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Abstract 

The estimation of population allele frequencies using sample data forms a central component of studies in population 
genetics. These estimates can be used to test hypotheses on the evolutionary processes governing changes in genetic 
variation among populations. However, existing studies frequently do not account for sampling uncertainty in these 
estimates, thus compromising their utility. Incorporation of this uncertainty has been hindered by the lack of a method for 
constructing confidence intervals containing the population allele frequencies, for the general case of sampling from a finite 
diploid population of any size. In this study, we address this important knowledge gap by presenting a rigorous 
mathematical method to construct such confidence intervals. For a range of scenarios, the method is used to demonstrate 
that for a particular allele, in order to obtain accurate estimates within 0.05 of the population allele frequency with high 
probability (>95%), a sample size of >30 is often required. This analysis is augmented by an application of the method to 
empirical sample allele frequency data for two populations of the checkerspot butterfly {Melitaea cinxia L), occupying 
meadows in Finland. For each population, the method is used to derive >98.3% confidence intervals for the population 
frequencies of three alleles. These intervals are then used to construct two joint >95% confidence regions, one for the set 
of three frequencies for each population. These regions are then used to derive a >95% confidence interval for Jost's D, a 
measure of genetic differentiation between the two populations. Overall, the results demonstrate the practical utility of the 
method with respect to informing sampling design and accounting for sampling uncertainty in studies of population 
genetics, important for scientific hypothesis-testing and also for risk-based natural resource management. 
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Introduction 

Spatiotemporal patterns of genetic variation among populations 
are often used to test hypotheses about processes underlying the 
patterns, such as selection, migration and genetic drift (e.g., 
[1,2,3,4,5]). This genetic variation captures differences in genetic 
structure among populations, where the genetic structure of a 
population is determined by the distribution of alleles among 
individuals in the population. The allele distribution of a 
population is the result of a range of biological and environmental 
processes acting on the population and also on surrounding 
populations within the same geographical region, which result in 
non-random mixing of gametes among individuals in all popula- 
tions. Patterns in allele distributions among populations are 
generally assessed by consideration of variations in allele 
frequencies (e.g., [6,7,8,9]). 

Logistic and ethical constraints mean that in practice, a 
population is unlikely to be sampled in its entirety, such that the 



population allele frequencies have to be estimated using a subset of 
sampled individuals. For diploid organisms in a large population, it 
has been established that the frequency of an allele in a sample 
provides an unbiased estimate of the frequency in the population 
as a whole [10]. Thus, if samples of a given size are repeatedly 
taken from a large population, then the mean frequency of an 
allele in a sample converges to the population allele frequency as 
the number of samples increases. However, many studies only take 
a single sample from a population and present or use the resulting 
frequencies of alleles in the sample, without accounting for 
sampling uncertainty (e.g., [11,12,13,14,15,16,17,18]). Therefore, 
these studies implicitly assume that the sample allele frequencies 
are close to the population allele frequencies, which is by no means 
guaranteed. Interpretation of findings from these studies is 
therefore complicated by the potential for large sampling 
uncertainty. Ideally, in this single sample case, uncertainty bounds 
for the population allele frequencies would be quantified, based on 
the sample allele frequencies. 
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A recent study by Hale et al. [19] used computer simulations to 
draw samples from four diploid populations, with allele frequen- 
cies based on four empirical datasets. Subsequendy, they used the 
sample data to derive the means, variances and ranges of allele 
frequencies in samples of varying size, as well as the means, 
variances and ranges of indicators that are a function of allele 
frequencies (specifically, heterozygosity and F ST ). Using these 
results, Hale et al. [19] concluded that a sample size of 30 is 
sufficient to accurately estimate population allele frequencies when 
using microsatellites. However, the authors did not quantify 
uncertainty bounds for the population allele frequencies using 
their sample data, such that their assessment of accuracy lacks a 
rigorous quantitative basis. Furthermore, for each sample size, 
sample frequencies were calculated using only 100 replicate 
samples, which may not give a good approximation of the true 
dispersion in the sample frequencies. This highlights a weakness of 
using a computational approach that lacks an underlying 
mathematical theory. Moreover, Hale et al. [19] did not consider 
the situation identified earlier, where one sample is taken and 
there is a need to quantify uncertainty of population allele 
frequencies using just the sample allele frequencies. For this 
situation, earlier studies have derived confidence intervals in order 
to capture uncertainty in the population allele frequencies. If a 
diploid population is of infinite size and is at Hardy-Weinberg 
equilibrium (HWE), then the frequency of an allele in a sample 
follows a binomial distribution [10]. Thus, Gillespie [20] proposed 
the use of the Wald confidence interval [21]. However, this 
interval only performs well for sufficiently large sample sizes — with 
small sample sizes, it is too short [22]. For the binomial 
distribution, other confidence intervals have been derived that 
do perform well for small sample sizes [22,23], notably the 
Clopper- Pearson interval that was derived eight decades ago [24] . 
However, these only apply in the limited cases when the 
population is at HWE. Weir [10] proposed a confidence interval 
that allows for deviations from HWE, analogous to the Wald 
confidence interval. However, like the latter, it is expected to 
perform well only for sufficiently large sample sizes [10]. This is 
problematic because the accuracy at a particular sample size is 
unknown, so "sufficiendy large" cannot be rigorously quantified. 
Moreover, all the confidence intervals considered only apply to 
cases when the population can be assumed to be of infinite size, i.e. 
when the population size is much larger than the sample size. This 
does not reflect the range of scenarios encountered in empirical 
research (e.g., [13,25,26,27]). 

In this study, we build on previous work by constructing 
confidence intervals for population allele frequencies for the 
general case where (i) the population is diploid, finite and can be of 
any size; (ii) the sample can take any size less than or equal to that 
of the population; and (iii) the population can deviate from HWE 
to any extent. These confidence intervals are guaranteed to 
contain the population allele frequency with a probability above a 
known threshold. The method derived for constructing these 
intervals is then used to calculate sample sizes required to achieve 
accurate estimates of population allele frequencies, under a range 
of scenarios. Here, accuracy is measured as the length of the 
confidence intervals. The sample sizes derived serve as a guide for 
determination of suitable sample sizes in future population genetic 
studies. In particular, we show that a sample size of 30 does not 
necessarily give accurate estimates, thus refining the conclusion of 
Hale et al. [19]. Lastly, we provide an example of how the method 
can be applied to microsatellite data for two populations of the 
checkerspot butterfly (Melitaea cinxia L.) [13], to derive confidence 
intervals for population allele frequencies and also for Jost's D, a 
measure of genetic differentiation between two populations that is 



a function of their population allele frequencies [9]. The data 
comes from a study [13] where it is unclear that the populations 
can be assumed to be of infinite sizes or at HWE. This example 
illustrates how the mathematical theory underlying our method 
can be used to quantify sampling uncertainty not only in 
population allele frequencies, but also in parameters that are 
functions of these frequencies, important for hypothesis-testing and 
also for natural resource management. 

Overall, this study provides a rigorous mathematical quantifi- 
cation of sampling uncertainty in population allele frequencies, 
without the typically unrealistic constraints of assuming that a 
population has an infinite size and is at HWE. Thus, the results 
can be applied to a wide range of studies in population genetics. 

Methods 

In order to construct confidence intervals for the general case of 
taking samples of any size from a diploid population of any size 
(larger than or equal to the sample size) and with any degree of 
deviation from HWE, the sampling distribution of the allele 
frequencies in this general case is first exactly specified. This 
distribution is then used to derive formulae specifying confidence 
intervals that contain the population allele frequencies with 
probability above a known threshold. Using these formulae, 
confidence intervals are derived for a range of archetypal 
scenarios, which are then used to calculate sample sizes that 
permit accurate estimates of population allele frequencies in these 
scenarios. In addition, the formulae are applied to a real scenario 
where samples are taken from two butterfly populations [13], to 
construct confidence intervals for the population allele frequencies 
of these two populations. The intervals are then used to derive a 
corresponding confidence interval for Jost's D [9]. 

Derivation of sampling distribution of allele frequencies 

Consider a population of M diploid individuals, from which a 
sample of N individuals is randomly drawn (M>N). At the locus 
of interest, there are n> 1 alleles, denoted by A,, i e {1, 2, ... ,«}. 
Let the population allele frequencies be denoted by />,, with the 
corresponding sample allele frequencies being denoted by Pi,N- 
Also, let Py be the frequency of individuals in the population with 
alleles A< and Aj {i,je{\,2, ...,«}), such that MPy is the 
number of corresponding individuals in the base population. /), is 
related to Py by the formula Pi = P,, + 2~2j=l j^i { P ij/^)- Here, 
P„ is a measure of the homozygosity of individuals in the 
population with respect to allele A-,. Pu=Pi— 2~2j=i j^i { p ii/2), 
and thus Pa<pi. If 0</>,<0.5, then the minimum value of Pa is 
0, since all copies of allele Aj can be distributed among 
heterozygotes of allele A,. However, if />,->0.5, then this is not 
possible - there must be at least one homozygote of allele Aj. In 
this case, the minimum number of homozygotes is realized when 
all heterozygotes have a copy of allele At, i.e. when 
P ii + Y%=lj*i P v = l- Rearranging this for YT]=\,]^i P ij and 
substituting into Pu=Pi — 2~2j=l j^i { P ijl^) gi ves the minimum 
value of Pa as 2/7,-1. Thus, overall, Max{0, 2/>, — 1} <P,, </>/• 

The frequency of allele Aj in the sample of size N is given by 
Pi,N= Yi^/(2N), where Yj^ is the number of copies of allele Aj 
in the sample. Thus, the probability distribution of /? !; jv is the same 
as the probability distribution of Yjjj except with the x-axis scaled 
by a factor 1 / (2N). Denote the probability mass function (pmf) for 
Y UN by P(Y UN =y UN ). The range of Y i>N is [0, 2N], so 
p (Yijf=ytjf)=0 for j/^jv outside this range. Thus, for the 
following calculations, only yjjj £ [0, 2N] axe considered. Now, 
Y it ff = 2Xn + Xj-, where Xj- = Yl" = j Xy and Xy is the number 
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of individuals in the sample with alleles Aj and Aj. Xa, Xj- and 
Zi = N — Xa — Xj- thus represent the number of individuals in the 
sample with two copies, one copy and no copies of allele Aj, 
respectively. X u , Xj- and Z, follow a multivariate hypergeometric 
distribution with pmf given by: 
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(1) 



where terms on the right-hand side in brackets are binomial 
coefficients. Since the number of individuals in the sample must 
equal N, xu + Xj+Zj = N. P{Y i< M=yi,N) is the sum of 
P(Xjj = Xn,Xi-=Xi-,Zj = Zi) for all those biologically feasible 
combinations of x„, Xj- and z,- satisfying = 2xu + Xj- . It will 
now be shown that this summation can be simplified to the sum of 
an expression that depends only on x„ and yi,N, over all Xa 
between lower and upper bounds that only depend on y,^. This 
considerably eases calculation of P{Yi,N =yi,N)- Firstly, the 
number of individuals in the sample with two copies, one copy 
or no copies of allele A, cannot exceed the corresponding numbers 
in the sampled population. This gives rise to three inequalities: 



x„<MP„, 



(2) 



(inequality (6)) ensures that xu<yi^ /2<N, as required. Denote 
the upper and lower bounds in (8) by L(yj^) and t/(_yyv) 
respectively. Then P(Yj^=yj^) can be written as: 

P(Yi,N=yi,N) 

tioot[u(y iyN )] 

= / ] P(Xu = xn, Xj- =Xj-, Zj =z,) 

x, Y =ceiling[Z.(>., ;A ,)] (9) 



floor^)] ^ A - 7=V „, A> = ,.. A _, v „ 
x a =ceUmg[L(y iN )] 



Zi = N + Xjj-yi,N 



where z, has been rewritten using Zj = N — Xu — X/- and 
x i-=yi,N — 2 x n> ceiling[a] is the smallest integer larger than a, 
and floor [a] is the largest integer smaller than a. The ceiling and 
floor functions are introduced to ensure that Xu is an integer, 
which it must be to have a biological interpretation. P( F/ jv =yi t N) 
is equal to P(fi,N =yi,N /(2iV)), me P™f for Pi,N, and thus defines 
the probability distribution of Equation (9) can be used to 
define the probability distribution of given only four 

parameters: M, p t , Pa and N. 

Derivation of confidence intervals for sample allele 
frequencies 

For given population size (M), population allele frequency (pi), 
population frequency of homozygotes with allele Aj (J? ff ) and 
sample size (N), the probability distribution for the sample allele 
frequency (p;,jv) can De calculated exactly using equation (9). The 
mean value of this distribution is: 



«,-<M^ P v , 

),)*i 



(3) 



N - x a - x r <M — MP n -MY^ p ij- ( 4 ) 

Secondly, the number of individuals with two copies, one copy or 
no copies of allele Aj in the sample must be non-negative. This 
gives rise to three more inequalities: 



Xjj>0, 



(5) 



E\PiM =E 



Yj,N 

2N 



2E[X u ]+E[X t ] 
2N 
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2N 



as in the case of sampling from an infinite diploid population [10]. 
In equation (10), the expectation = NPjj has been used, 

which follows from the fact that the Xj/s, with i,j e {1, 2, ... ,«}, 
follow a multivariate hypergeometric distribution with parameters 
M, N and MPjj [28]. In addition, it can be proved that the 
variance of is 



Xj>0, 



N — Xj, ■ — Xj- >0. 



(6) 



(7) 



Since Pi = Pu+YTj=ij^i{ p i)l' 1 ) and yi,N = 2xjj + x f , then 
J2j=i,j^iPij = 2(Pi — Pn) ancl Xj-=y itN — 2xjj. Thus, the inequal- 
ities (2)-(7) can be rearranged to obtain the double inequality: 



(8) 



MaX {¥ " Mpi + MP '" y, ' N ~ N ' °) - Xii ~ 
Mm^MPjj, M-N+yw-lMpi + MPu}. 

It is noted that Xj- =y t ^ — 2xjj, which means that Xj- >0 



(M-N){ P j~2pj + P„) 
2{M-\)N 



(11) 



(see File SI). The variance thus includes a standard finite 
correction factor (M — N) / (M — 1) that tends to 1 as M->oo, 
as required. Therefore, as M— >co, 

f [P;',iv] -»(pi — 2pj +Pji) /(2N), the variance in the case of an 
infinite population size [10]. The cumulative distribution function 
(cdf) for pi t N is specified by: 



P(j>j,N<w) =P( P ,,N<—} 



2N) 

y,,N (12) 
-P(Yj^<y i , N )=Y t P(Y i j, = k), 
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where y^N is an integer in the interval [0, 2N] . This cdf can be 
calculated using equation (9) and is a function ofp,-. To construct a 
CI for pi given M, JV and P„, consider testing the null hypothesis 
Ho '■ Pi = Pi,o against the alternative Hi : /?, # /),;o at significance 
level a for an observed value of Y,,at, denoted by ytjj. The null 
hypothesis is not rejected if using Pi=P/fl, yyv falls within an 
acceptance region defined by B x = [yijf, a /2,yij/,i-( tt /2)] , where 
yi,N,x/2 i s the largest integer for which P( Y/^j < y,\N,x/2) ^ a /2 and 
yi,N,l-(x/2) is the smallest integer for which 
P(Yi,N>yi,N,l-(x/2)) <a/2. B a can be calculated using equation 
(12). One method of constructing a > 100(1 — a) % CI for is to 
determine the set of values of />,,o for which the null hypothesis is 
not rejected at significance level a, denoted by 
Pi,H 0 ,* = {Pi,0 ■ Pi,N e B a ), and defining the CI as 

[L„ U„] = [Min(p, > fl b>( ,),Max(p t fl b>( ,)] (13) 

[29]. This hypothesis-testing approach corresponds to the "test- 
method" described by Talens [30], who applied it to construct 
CI's for parameters of the univariate hypergeometric distribution. 
If Pi,H 0 ,x = 0> the empty set, then the acceptance region needs to 
be extended to include j>;,jv for at least one value otpi$. Thus, in 
this case, a is decreased until j>,yy is in the acceptance region for at 
least one /> !; o value. 

The CI specified by equation (13) is not an exact 100(1 — oc)% 
CI because the distribution for F^jv is discrete and because there 
may be some p,fi values between Min(P,-^ 0 a ) and Max(P J // 0>a ) 
for which y^ £ B a , since P( F,;jv <yiji) is not guaranteed to be a 
monotonic function of />, (see equations (9) and (12)). It is noted 
that for the probability parameter in a binomial distribution, 
Clopper and Pearson [24] derived > 100(1— a) % CI's using an 
analogous method. Thus, for the case of a large population size 
relative to the sample size (M> >N) and HWE, where pi^ 
approximately follows a binomial distribution, CI's for />,- derived 
using our method would be virtually the same as Clopper-Pearson 
CI's. This can be verified by explicitly calculating and comparing 
CI's using the two methods - for example, if 
M= 1,000,000 > > A = 30 and y hN = 10, then both methods give 
the CI [0.083, 0.285] for Pi . 

In deriving equation (13), it was assumed that P„ is known. 
However, P,-,- is generally unknown. In this case, a > 100(1 — a)% 
confidence region (CR) can be derived for />, and P„; in an 
analogous way, by considering the null hypothesis 
H'o ■ Pi=Pi,o, Pu = Pu,o- The CR can be derived by determining 
the set of vectors (p ;> o, Pa,o) for which the null hypothesis is not 
rejected at significance level a. This set is denoted by 
Pi,H' 0 ,* = {(Pi,o, P«,o) : Pi y N e B *}, wn ere B a is as defined before. 
Using the CR, > 100(1 — a) % CI's for p t and P„ can be defined as 

[L' p ., a , U' P . A ] = [Min (Pyr 0 ,«,i) , Max (p^ 0>a ,i ) ] , (14a) 

and 

[L' PaA , U' PihX ] = [Min (Pvr 0 jJ) , Max (p,, ffo , a , 2 ) ] , (14b) 

where Pi,H' 0 ,mj is the set of values consisting of the jth elements in 
the set of vectors Pi,H' 0 ,ii- In determining Pi t H' 0 ,x> Pi,o and P/^o 
values over the biologically feasible ranges are tested. Since the 
number of copies of allele A t in the population must be at least the 
number found in the sample, y iN /(2M) <Pi$. Also, the number 



of copies of all other alleles in the population must be at least the 
number found in the sample, IN—y^; this constrains the 
maximum value of />,;o according to 

p U 0<[2M-(2N-y iiN )]/(2M) = l-\(2N-y m )/(2M)]. For 
a given value of/>,;o, Max{0, 2/7,;o — 1 } < Pft',0 ^ Pi.o, as determined 
earlier. If P^h' 0 ,x = 0> then the acceptance region is extended to 
include j>,yy for at least one pair of values of />,;o and P, ;> o, by 
decreasing a. 

CI's specified by equation (13) can be calculated given M, N, Pa 
and a, whereas those specified by equations (14a) and (14b) can be 
calculated given just M, N and a. In this paper, computation of 
any CI's using these equations, incorporating the underlying 
formulae specified by equations (1), (8), (9) and (12), was carried 
out using the software package Mathematica v5.0 [31]. Supporting 
Webpage 1 provides Mathematica code for computation of the CI's 
and can be viewed at http://rpubs.com/kkeenan02/Fung- 
Keenan-Mathematica/. However, other software packages such 
as MATLAB [32] and R [33] could also be used to implement the 
formulae; indeed, an R version of the code is provided on 
Supporting Webpage 2 and can be viewed at http://rpubs.com/ 
kkeenan02/Fung-Keenan-R/. All source code for the composition 
of Supporting Webpages 1 and 2, including the raw Mathematica 
and R code used, can be accessed at https://github.com/ 
kkeenan02/Fung-Keenan2013/. 

Determining minimum sample sizes for accurate 
estimation of population allele frequencies 

In studies of population genetics, it is desirable to obtain a 
sample allele frequency close to the population allele frequency 
with high probability [19], i.e. a short CI of high probability. 
Thus, under three representative scenarios, we calculate the 
minimum sample sizes required to obtain >95% CI's with lengths 
<0.2 and <0.1 across all possible values of the observed sample 
allele frequency, pj^ =y iN / (2N). These minimum sample sizes 
are denoted by iV<o.2 and iV<o.i respectively. They are calculated 
by starting with A" =10, computing CI lengths for all possible 
values of p^N and then taking the maximum value. This is 
repeated for increasing N in increments of 1 0 until the maximum 
CI length becomes <0.1. The N value with maximum CI length 
closest to 0.2 is then chosen, and then increased or decreased as 
necessary to find A<o.2- A<o.i is derived analogously. Maximum 
CI lengths of 0.2 and 0.1 represent small maximum absolute errors 
in the estimated allele frequency of 0.1 and 0.05 respectively, if the 
estimate is taken as the mid-point of the CI. The first scenario 
examined, Scenario 1 , is sampling from a population of M = 1 ,000 
at HWE. M= 1,000 is larger than or on the same order of 
magnitude as the upper bound of the range of size estimates for 
15/24 (63%) species populations collated by Frankham et al. [25], 
covering mammals, birds, insects and plants. Thus, M= 1,000 is 
taken to represent a large population. Later, M is varied over two 
orders of magnitude to consider populations that range from small 
to very large. Since there is a HWE, Pu=p\. Also, in the sampled 
population, the number of homozygotes with allele i, MP a, must 
be an integer. This restricts the number of values that pi can take 
to 1 1 equally spaced values in the interval [0, 1] (starting from 0). 

Scenarios 2 and 3 represent situations where HWE does not 
hold, such that Pujtpj. In Scenario 2, Pa is assumed to take its 
minimum value, which is Max{0, 2pi — 1}, whereas in Scenario 3, 
P„ is assumed to take its maximum value, which is />,. In 
comparison with Scenario 1 , pt can now take a larger number of 
values - it can take the 2,001 equally spaced values in the interval 
[0, 1] in Scenario 2 and the 1,001 equally spaced values in the 
interval [0, 1] in Scenario 3. Since the variance of is an 
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increasing function of P„ (equation (1 1)), CI's derived using />,yy 
are expected to increase in length with P u as well. This means that 
CI lengths and minimum sample sizes (N<o.2 and iV<o.i) derived 
for Scenarios 2 and 3 are expected to encompass the entire ranges 
of possible values. 

In the three scenarios examined, P„ is specified as a function of 
Pi, such that equation (13) can be used to calculate a CI for p t 
given a value otpj,N- Calculation of this CI involves considering all 
possible values of pi and determining under which values ptji falls 
within the corresponding acceptance region (as described above). 
An alternative scenario, Scenario 4, is that the relationship 
between P„ and pi is unknown. In this case, equation (14a) has to 
be used to calculate a CI for />,- given a value of p^, which 
involves considering all possible values of/), and P„. This results in 
a considerable increase in computation time; for example, when 
sampling A = 30 individuals from a population of M= 1,000, 
2,001 values of /), need to be considered but 1,002,001 
combinations of />,■ and P„ need to be considered, representing 
an increase in computational time by two orders of magnitude. 
However, for a given value of/),-, the maximum value of Pa gives 
the highest variance for /?,-# and is thus expected to maximize the 
length of the acceptance region within which /?, jv could fall within. 
Thus, CI's for pi derived under Scenario 4 are expected to closely 
match those derived under the case of maximum homozygosity 
(Scenario 3). This can be verified by explicit calculation of the CI's 
- for example, given M=100, A = 30 and = 15, the CI's 
derived under Scenario 4 and Scenario 3 are [0.14, 0.4] and 
[0.15, 0.39], representing a difference in length of only 0.02. 
Similar results hold for />, = 10 and 5. Therefore, results from 
Scenario 3 are used to approximate those for Scenario 4, obviating 
the need for long computational runtimes to explore all possible 
combinations of/)/ and Pu. 

Lastly, from equation (1 1), the variance of />,-,at is an increasing 
function of M. Thus, CI lengths for pi are expected to increase 
with M, which would result in increases in A<o.2 and iV<o.i- To 
test this explicitly, for a population at HWE and with A = 30, 
maximum lengths for the > 95 % CI for pt (across all values ofp,yy) 
are calculated for M=100, 250, 500, 750, 1,000, 2,500, 5,000, 
7,500 and 10,000. This range corresponds to populations that are 
small to very large [25]. 

Application to an empirical data set for checkerspot 
butterflies 

To demonstrate how the theory developed can be used in 
practice, it is applied to microsatellite data for samples from two 
populations of the checkerspot butterfly (Melitaea cinxia L.) 
occupying meadows on the Aland Islands in Finland [13]. 
Specifically, for the CINX1 locus, > 100(1- 2a) % CI's are 
calculated for the frequencies of alleles A, B and C for the Prasto 
and Finstrom populations, using the corresponding sample allele 
frequencies (Table 1 of Palo et al. [13]). For each population, a CI 
is not calculated for the population frequency of the fourth and 
final allele D, because this is fixed by the population frequencies of 
the first three alleles. To achieve consistency with the notation 
used in our study, henceforth, alleles A, B, C and D are referred to 
as alleles A\, A2, A3 and A4 respectively. The sample size for the 
Prasto population is N\ = 53, whereas that for the Finstrom 
population is A 2 = 74 [13]. />,-,#, and qi,N 2 , 16 {1,2,3,4}, are used 
to denote the sample allele frequencies of A, in the Prasto and 
Finstrom populations respectively. Similarly, />, and qi are used to 
denote the population allele frequencies of A, in the two 
populations, respectively. The two populations consist of two 
and seven subpopulations respectively. Thus, technically, the two 
populations can be referred to as metapopulations, although this 



terminology is not used in our study for clarity. The subpopula- 
tions form part of a total of about 536 subpopulations on the 
Aland Islands, with an estimated total size ranging from 35,000 to 
at least 200,000 [13]. Thus, the size of each subpopulation is 
assumed to be [(35,000 + 200,000)/2]/536 = 219, such that the 
Prasto and Finstrom populations are assumed to have a size of 
Af 1= 2x219 = 438 and M 2 = 7 x 219 = 1533 respectively. The 
observed sample allele frequencies in the two populations, denoted 
by /? ;> jv, and q^ff 2 respectively, are taken directly from [13]. These 
are used to calculate the observed number of copies of allele Ai in 
each population, denoted by y, j\r, and z, /y 2 respectively, using the 
formulae j^yv, = 2N\p i ^ l and z, jv 2 = 2A r 2<7,,jV 2 ■ Due to rounding 
error in />,■#, and q^ 2 values from [13], j'yy, and z,yv 2 had to be 
rounded to the nearest integer. Since the true homozygosity of 
each allele in each population is unknown [13], conservative CPs 
are calculated assuming P„ takes its maximum value of />,- (this is 
expected to maximize the lengths of the CI's, as explained in the 
previous section). In addition, a = 0.025/3 is chosen, such that a 
>98.3% CI is derived for each of the three population allele 
frequencies in each of the two populations. The reason for this 
choice of a is because for each population, using the Bonferroni 
Inequality [34], the cubic region defined by the three CI's can be 
taken as a > 100(1 —2/?)% confidence region (CR) for the three 
population allele frequencies, where ?>a>/}. The choice of 
a = 0. 025/3 allows >95% CR's to be derived for each set of 
three population allele frequencies in the two populations. 

To demonstrate how the theory developed in this paper can be 
used to derive CI's for genetic indicators that are a function of the 
population allele frequencies, the >95% CR's are used to 
calculate a >95% CI for Jost's D for the CINX1 locus and the 
Prasto and Finstrom butterfly populations. Jost's D is a measure of 
genetic distance between populations [9]. For the case of two 
populations, it is given by 



D jost — 2 



(15) 



where, using the notation in this example, 



JT = ±(^) 2 =\^ + q*) 2 + \±(p, + q l ) 



(16) 



and 



Js = 



; = 1 1=1 
2 



(17) 



+*EW+«?)- 



■ 1=1 



A lower limit to a >95% CI for Dj os t can be derived by 
minimizing the function in (15) given the constraints that the six 
population allele frequencies are contained within the two 
corresponding >95% CR's, and the two constraints 

1 — E= 1 Pi = P* ^ 0 and 1 — J2^= 1 1i = ?4 > 0. This lower limit 
is denoted by Lb. Similarly, the upper limit to a >95% CI for 
Z)j ost can be derived by maximizing the function in (15) under the 
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same constraints. This is denoted by Ud- In this study, the 
"Minimize" and "Maximize" functions in Mathematka v5.0 [31] 
are used to compute Ljy and Up, but corresponding functions may 
be used in other software packages, such as the "solnp" function in 
the "Rsolnp" R package. A >95% CI for Z)j ost can then be 
defined as the interval [Lp, Ud]- Supporting Webpage 1 provides 
Mathematka code that can be used to calculate [Ld, Ud] for the 
butterfly case study examined (http://rpubs.com/kkeenan02/ 
Fung-Keenan-Mathematica/). Corresponding R code is presented 
on Supporting Webpage 2 (http://rpubs.com/kkeenan02/Fung- 
Keenan-R/). 

Results 

Maximum length of >95% confidence interval with 
increasing sample size 

For Scenario 1 , where the sampled diploid population of size 
M= 1,000 was at HWE, the maximum length of the >95% CI 
for the population frequency of allele Aj, />,, considering all 
possible observed values of the sample allele frequency, 
decreased non-linearly with sample size iV (Figure 1). The 
minimum N required to achieve a maximum length <0.2, 
N<o.2, was 22, whereas the minimum N required to achieve a 
length <0.1, N< 0 .l, was 49 (Figure 1). 

For given N and pi,N, the CI for />, consists of all possible values 
of pi for which p, yy lies in the corresponding acceptance region, as 
described in Methods. This acceptance region is expected to 



0.7 



C£T 0.6 
o 



O 0.5 



t 1 r 



Scenario 1 : HWE 

Scenario 2: Min homozygosity 

Scenario 3: Max homozygosity 
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Figure 1. Change in maximum length of >95% confidence 
interval with increasing sample size. Graph showing how the 
maximum length of the >95% confidence interval (CI) for the 
population frequency of an allele Aj (p,-) changes with increasing 
sample size N, when sampling from a diploid population of size 
M= 1,000. For a given N, the maximum CI length was derived by 
calculating CI lengths for all possible values of the observed sample 
allele frequency and then taking the maximum length. The three curves 
correspond to three scenarios where the population is (1) at Hardy- 
Weinberg equilibrium (HWE), (2) attains its lowest homozygosity value 
with respect to Aj, and (3) attains its highest homozygosity value with 
respect to A,. For each curve, the two filled circles represent the 
minimum N values required for the maximum CI length to reach values 
of <0.2 and <0.1. For visual guidance, the two dashed horizontal lines 
mark maximum CI lengths of 0.2 and 0.1. The exact values of N tested 
are described in Methods, and equation (1 3) was used to calculate the CI 
lengths. 

doi:10.1371/journal.pone.0085925.g001 




Observed value of p 

Figure 2. Change in length of >95% confidence interval across 
different observed values, under Hardy-Weinberg equilibrium. 

For a sample of size N taken from a population of size M = 1,000 at 
Hardy-Weinberg equilibrium, graph showing the length of the >95% 
confidence interval (CI) for the population frequency of an allele A t (p,-) 
across all possible observed values of the sample allele frequency (p,yy). 
The three curves correspond to 7V= 10, 30 and 50. Equation (13) was 
used to calculate the length of each CI. 
doi:10.1371/journal.pone.0085925.g002 

increase with the variance ofp,yy, a ]pi,N\- At HWE, the frequency 
of homozygotes of allele Aj, P u , is equal to p 2 ; thus, according to 
equation (11), a 2 \p^fi\ takes its highest values at intermediate 
values of p,. As a result, intermediate values of />,yy are likely to fall 
within the acceptance regions of more values of />,, such that the 
corresponding CI lengths are generally longer. Simulation results 
for Scenario 1 were broadly in agreement with these theoretical 
expectations (Figure 2). 

In Scenario 2, where the population was no longer at HWE but 
attained its lowest Pa, the maximum CI length also decreased non- 
linearly with increasing N (Figure 1). Compared with Scenario 1, 
N <o.2 was slighdy larger and N<o\ was larger by about a factor of 
two, taking values of 26 and 94 respectively (Figure 1). This is 
contrary to expectations that minimum homozygosity would give 
smaller iV<o.2 and iV<o.i (see Methods). The reason is that although 
a \pi,N\ is smaller for given JV and />, compared with the case of 
HWE, which would be expected to result in fewer pi values for 
which a given /),yy falls within the corresponding acceptance 
regions and hence a shorter CI, there are more possible values of 
Pi (2,001 compared with 11 - see Methods), which increased the 
number of pi values for which p t yy falls within the corresponding 
acceptance regions. For Scenario 2, P,, =Max{0, 2/>, — 1}, such 
that o 2 \pjfl] attains its highest values at Pi values close to 0.25 and 
0.75 (equation (1 1)). Thus, for given N, values ofp,yy near 0.25 and 
0.75 are expected to generally exhibit the longest CI lengths. 
Simulation results for Scenario 2 were broadly in agreement with 
these theoretical expectations (Figure 3). 

For the last scenario, Scenario 3, the population attained its 
highest Pu, representing the opposite extreme to Scenario 2. As for 
the previous two scenarios, the maximum CI length decreased 
non-linearly with increasing TV (Figure 1), but this time, A<o.2 and 
7V<o.i took values of 94 and 285 respectively. These values were 
approximately four and six times as large as the corresponding 
values in Scenario 1 (Figure 1). In Scenario 3, Pu=Pi, and 
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Figure 3. Change in length of >95% confidence interval across 
different observed values, under minimum homozygosity. For a 

sample of size N taken from a population of size M= 1,000 with the 
minimum homozygosity possible for an allele A h graph showing the 
length of the >95% confidence interval (CI) for the population 
frequency of At (pi) across all possible observed values of the sample 
allele frequency (pyv). The three curves correspond to N — 10, 30 and 
100. Equation (13) was used to calculate the length of each CI. 
doi:1 0.1 371 /journal.pone.0085925.g003 



Figure 4. Change in length of >95% confidence interval across 
different observed values, under maximum homozygosity. For 

a sample of size N taken from a population of size M = 1,000 with the 
maximum homozygosity possible for an allele A„ graph showing the 
length of the >95% confidence interval (CI) for the population 
frequency of A, (pi) across all possible observed values of the sample 
allele frequency (/>yv). The four curves correspond to N — IQ, 30, 100 
and 200. Equation (13) was used to calculate the length of each CI. 
doi:10.1371/journal.pone.0085925.g004 



equation (11) shows that intermediate values of pi give the highest 
values of a [pj,iv]- Therefore, as in Scenario 1, intermediate values 
of pi t M are expected to generally exhibit the longest CI lengths for 
given N. Again, simulation results were consistent with these 
expectations (Figure 4). 

Maximum length of >95% confidence interval with 
increasing population size 

When taking a sample of size N = 30 from a diploid population 
at HWE, the maximum length of the >95% CI for p it across all 
Pi,N, increased with population size M (Figure 5). As M was 
increased from a small value of 100 to 1,000, the maximum CI 
length remained the same at 0.2 (this is possible because there is 
nothing to prevent different values of M giving the same maximum 
CI length, using equation (13)). The maximum CI length increased 
when M was increased from 1,000 to a large value of 2,500, but 
only by 0.04. Thereafter, the length remained constant up to a 
very large value of M = 7,500, and only increased by a small 
amount of 0.02 with a further increase in M to 10,000. Thus, 
simulation results confirm the theoretical expectation that CI 
length increases with M (as explained in Methods - this expectation 
arises because a 1 \pi,N\ increases with M, according to equation 
(11)). However, the increase in the CI length was modest as M was 
increased over two orders of magnitude. 

> 95% confidence interval for Jost's D between two 
checkerspot butterfly populations 

For the Prasto checkerspot butterfly population, the >98.3% 
CFs for the population frequencies for three of the four alleles at 
the CLNX1 locus were computed using equation (13) and data from 
Palo et al. [13], as described in Methods. The three alleles are 
denoted by A\, A2 and A3 respectively, with the population 
frequencies denoted by p\, P2 and pi, respectively. The three 
corresponding >98.3% CPs derived were [0.002,0.080], 



[0.598,0.872] and [0.114,0.381] respectively. Using the same 
methodology, for the Finstrom population, the > 98.3% CFs for 
the population frequencies of Ai, A2 and A3 were calculated 
as[0.003, 0.109], [0.714, 0.914] and [0.003, 0.109] respectively. 



0.28 1 , 1 , 1 , 1 , 1 r 




Population size, M 



Figure 5. Change in maximum length of >95% confidence 
interval with increasing population size. Graph showing how the 
maximum length of the >95% confidence interval (CI) for the 
population frequency of an allele Ai (pi) changes with increasing 
population size M, when taking samples of size TV = 30. The population 
is at Hardy-Weinberg equilibrium. For a given M, the maximum CI 
length was derived by calculating CI lengths for all possible values of 
the observed sample allele frequency and then taking the maximum 
length. M values of 100, 250, 500, 750, 1,000, 2,500, 5,000, 7,500 and 
10,000 were tested, as indicated by the filled circles. 
doi:10.1371/journal.pone.0085925.g005 
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The three > 98.3% CPs for the Prasto population were used to 
form a cubic region, which corresponds to a > 95% CR for A\ , A2 
and A3 [34]. In the same way, the three > 9 8 . 3 % CI's for the 
Finstrom population were used to form a >95% CR for A\, A2 
and A3. Within these two CR's, the maximum and minimum 
values of fj os t (-Djost is given by equation (15)) were calculated and 
used to derive a >95% CI for -Dj os t, as described in Methods. This 
CI for Z> Jost was derived as [2.34 x 10 -5 , 0.186] . If D Jost was 
calculated simply using the sample allele frequencies and equation 
(15), then only one value would have been obtained: 0.043. 

Discussion 

In scientific studies that use sample data to estimate unknown 
population parameters, the sampling uncertainty in the estimates 
needs to be quantified in order to make reliable inferences on 
population processes captured by the parameters. This forms an 
essential part of scientific hypothesis-testing. Therefore, in studies 
of population genetics, it is essential to quantify the sampling 
uncertainty of key population parameters used to infer past and 
present evolutionary processes. These include allele frequencies, 
which are often used to quantify genetic variation among 
populations, thereby allowing hypotheses on processes driving this 
variation to be tested (e.g., [1,2,3,4,5]). However, many studies do 
not include sampling uncertainty for allele frequencies, instead 
presenting and/or using single point estimates based on one 
sample per population (e.g., [1 1,12,13,14,15,16,17,18]). Thus, it is 
not possible to assess the accuracy of any inferences from these 
studies. This hinders not only the advance of scientific knowledge 
but also decision-making based on this knowledge, such as for 
sustainable management and conservation of natural resources. In 
this context, the work presented in this paper is valuable in that it 
provides a method of quantifying sampling uncertainty in allele 
frequencies for diploid populations, in the form of confidence 
intervals (CI's) containing true values with probability equal to or 
greater than a desired threshold. 

The method presented pertains to the general case of a locus 
with n alleles, with a sample of size N taken from a population of 
size M and any degree of homozygosity with respect to the n 
alleles. In this case, the method allows construction of a CI for the 
population frequency of each allele, which can then be combined 
to create a joint confidence region (CR) for all the population allele 
frequencies at a given locus. It is noted that if more than one locus 
is considered simultaneously, then a joint CR for all population 
allele frequencies at all loci can be calculated by combining CI's in 
an analogous way. For the subcase of an infinite population size 
(M->oo) and a large sample size of JV>30, Weir [10] had 
proposed an approximate 100(1— 2a) % CI for the population 
allele frequency of an allele Aj, />,. This is 
[p,yv — z i -a5'[Pi,Jv] i Pi.N — z x^\Pi,N] ] , where pi t ff is the sample allele 
frequency; z a satisfies <$>(z x ) = a, with fl> being the cumulative 
distribution function (cdf) of the standard Normal distribution; and 
g\Pi,n] is an estimate of the standard deviation of/?/^ using sample 
data. d\pifl] is specified by equation (1 1) with p,jf replacing p t and 
Pu,N> the sample frequency of homozygotes with allele A,, 
replacing P„, the corresponding population frequency of homo- 
zygotes. However, the accuracy of this CI depends both on how 
close o\pi,N] is to the true standard deviation <r[p,,jv] (specified by 
equation (11)) and how close the cdf of p^N is to a Normal 
distribution with the same mean and variance. This accuracy has 
not been quantified [10] and thus, it is not known whether the CI 
actually contains />, with a probability of at least 100(1— 2a), 
rendering its use problematic. In this study, we have rectified this 
problem by constructing a CI for />, with probability coverage of at 



least 100(1 —2a), for the more general case where the population 
size can take any value larger than or equal to the sample size. 

The method we derived was used to show that the sampling 
uncertainty in measured as the maximum length of the >95% 
CI for pi across all possible values of the observed sample allele 
frequency pi,N, decreased non-linearly with TV when sampling from 
a large population (M = 1,000) under three archetypal scenarios. 
These three scenarios represent the cases where the population (1) 
is at Hardy- Weinberg equilibrium (HWE), (2) has the lowest value 
of Pa and (3) has the highest value of P„ . As expected from theory, 
for any given jV, the maximum CI lengths for Scenario 3 were 
always greater than corresponding lengths in Scenarios 1 and 2. 
However, the maximum CI lengths for Scenario 2 was unexpect- 
edly greater than that in Scenario 1 for some values of N, reflecting 
the greater possible number of values pi can take in a finite 
population with minimum homozygosity compared to one at 
HWE. This illustrates how the finite size of a population can give 
opposite trends to those obtained under an assumption of infinite 
size. According to theory and simulations, Scenario 3 gives CI 
lengths that closely approximate those in the case where Pa is 
unknown (see Methods). Thus, if P„ is unknown, CI lengths derived 
under Scenario 3 should be used. This was the approach used in 
the application of our method to sample data for two butterfly 
populations, discussed further below. On the other hand, if there is 
evidence that the population is at HWE, then the shorter CI's 
derived under Scenario 1 could be used. 

Under the three scenarios examined, the non-linear decreases in 
sampling uncertainty with increasing N are consistent with results 
from the simulation study of Hale et al. [19], who found that the 
average difference between pt^ and Pi also exhibited non-linear 
decreases with N. However, Hale et al. [19] did not use their 
simulation results to quantify sampling uncertainty for the realistic 
situation where only one sample is taken from a population; this 
situation was considered in our study. Furthermore, as mentioned 
in the Introduction, for given N, they only used 100 samples to 
numerically construct the distribution for Pi,N, resulting in an 
incomplete distribution that may not closely reflect the true 
distribution. This highlights a weakness of a simulation-based 
approach without a rigorous mathematical underpinning, which is 
present in our approach. Hale et al. [19] concluded that N= 25- 
30 is sufficient to give accurate estimates of />,-, but this conclusion 
has to be interpreted in light of the limitations identified. Our 
results refine this conclusion by showing that across the three 
scenarios examined, N= 49-285 is required to ensure that, with a 
high probability of >0.95, an estimate for p t can be derived from 
any one sample that is within 0.05 of the true value; this 
corresponds to a CI of length <0.1. To ensure that the estimate is 
within 0.1 rather than 0.05, corresponding to a CI of length <0.2, 
our results show that ./V= 22-94 is required. Thus, ./V = 30 is not 
guaranteed to give "accurate" estimates of pi under all or most 
scenarios, and N values up to 10 times larger could be required. 
Decreasing the population size M from 1,000 would help to 
decrease sampling uncertainty, but results showed that decreasing 
Mover two orders of magnitude from 10,000 to 100 only resulted 
in modest decreases in the maximum CI length of <0.06, with no 
decrease when M was decreased from 1,000 to 100. Thus, the 
overall conclusion is that N = 30 is often insufficient to guarantee 
accurate estimates of/;,, in the sense that p, is within 0.05 or 0.1 of 
the estimate. Considering that alleles at highly polymorphic loci, 
such as microsatellites, often occur at population frequencies of 
<0.05 [35], it might be desirable to derive CI's for p t that are of 
length <0.1. Thus, sample sizes even larger than the values found 
in our simulations might be required under some circumstances. 
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The application of our method to empirical data for two 
populations of the checkerspot butterfly [13] demonstrated how 
the underlying theory can be applied to construct joint >95% 
CR's for the population frequencies of multiple alleles at a single 
locus. These CR's were then used to construct a >95% CI for 
Jost's D, which measures genetic differentiation between the two 
populations. This illustrates how our method can be used to 
quantify sampling uncertainty in genetic indicators that are a 
function of population allele frequencies, thus facilitating hypoth- 
esis-testing and also risk-based natural resource management. In 
the example considered, the single point estimate of Jost's D using 
the sample allele frequencies was about four times lower than the 
upper bound of its >95% CI. Thus, use of the single point 
estimate without accounting for sampling uncertainty could lead to 
misleading conclusions. The effectiveness of any management 
measures based on such conclusions would be compromised, 
hindering the achievement of objectives related to conservation or 
sustainable use. Our example therefore highlights the practical 
utility of the method that we have derived. 

In conclusion, we have presented a rigorous mathematical 
method for quantifying sampling uncertainty in estimates of 
population allele frequencies, for a general case that has hitherto 
not been analyzed. In addition, we have demonstrated its practical 
application in informing sampling design and determining 
uncertainty in genetic indicators. Thus, the method derived 
advances both theory and practice, with broad implications for a 
range of disciplines, including: conservation genetics, evolutionary 
genetics, genetic epidemiology, genome wide association studies 
(GWAS), forensics and medical genetics. In particular, the method 
provides exact answers to the question of how many individuals 
need to be sampled from a population in order to achieve a given 
level of accuracy in estimates of population allele frequencies. This 
is a question that has rarely been studied before [19], despite its 
important practical implications. Previous studies have derived 
sample sizes required to sample, with high probability, at least one 
copy of all alleles at a locus above a given frequency [35,36,37], 
but these do not correspond to sample sizes required to achieve 
accurate estimates of population allele frequencies. Derivation of 
the latter requires explicit quantification of sampling uncertainty, 
as we have done in this study. 

Possible future extensions 

The CI's and CR's constructed using our method are 
conservative in the sense that they contain the true values with a 
probability equal to or greater than a desired threshold. This 
conservative property is useful in hypothesis-testing if there is a 
need to decrease the probability of obtaining a false positive below 
a certain threshold. However, researchers would ideally like to 
construct CI's and CR's containing true values with a known 
probability, not just with probability at or above a known 
threshold. Therefore, future research could attempt to tighten 
the intervals and regions that we have derived, ideally until they 
cover a known probability. For example, Cai and Krishnamoorfhy 
[23] devised a method (using their "Combined Test" approach) 
that was shown to give shorter CI's for the probability parameters 
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